Phased chromosome-scale genome assembly of an asexual, allopolyploid root-knot nematode reveals complex subgenomic structure

We present the chromosome-scale genome assembly of the allopolyploid root-knot nematode Meloidogyne javanica. We show that the M. javanica genome is predominantly allotetraploid, comprising two subgenomes, A and B, that most likely originated from hybridisation of two ancestral parental species. The assembly was annotated using full-length non-chimeric transcripts, comparison to reference databases, and ab initio prediction techniques, and the subgenomes were phased using ancestral k-mer spectral analysis. Subgenome B appears to show fission of chromosomal contigs, and while there is substantial synteny between subgenomes, we also identified regions lacking synteny that may have diverged in the ancestral genomes prior to or following hybridisation. This annotated and phased genome assembly forms a significant resource for understanding the origins and genetics of these globally important plant pathogens.

Genomics Ltd), but increased the total number of scaffolds from 37 to 66. Following scaffolding with Hi-C, samba [40] joined some small scaffolds and fragmented the largest scaffold in the assembly, increasing the number of scaffolds from 66 to 69.
The final assembly scaffolds contains 150,545,692 bp with an N50 of 5,793,182 bp, at 30.11% GC content, and overall, 99.96% of reads in our combined PacBio HiFi library map successfully back to the assembly, indicating a high level of completeness (S1 Table).
Of the total assembly, 97.87% was contained in the longest 33 scaffolds (S3 Fig), which ranged from 898 kbp to 9,595 kbp in length (Fig 2A; S6 Table). These 33 scaffolds contained 99.85% of all transcribed gene models. Of the remaining 36 scaffolds, 4 were identified by blobtools [41] as likely contaminants (Arthropoda, Chordata, and Streptophyta;

Core Eukaryotic Gene and Single Universal Copy Ortholog Analysis
Analysis of the final assembly with CEGMA [42]

Coverage and ploidy analysis
Mean coverage depth for all scaffolds -excluding mitochondrial -was 206.6x, falling to 206.1x for the longest 33 scaffolds (Fig 2B). For some scaffolds (2, 9 and 18) coverage depth was significantly in excess of this average, indicating collapsed regions; that is, representation of three or four rather than two copies. Collapse is expected to arise in a polyploid assembly when homoeologous sequences are similar enough at a nucleotide level that they are inferred to be multiple homologous alleles of the same region, which are then collapsed to form the reference assembly copy [8]. Other scaffolds (19,22,24,33) were underrepresented suggesting that they may be present as a single copy.
We then examined the coverage depth frequency distribution for each scaffold (Sup Fig 7). For phased scaffolds representing two identical homologs, the frequency of coverage for all bases in each scaffold would be expected to have a single peak (240x).
However, while predominantly single peaks were seen for under-represented scaffolds thought to be present as a single copy (19,22,24, 33), we observed two peaks for the majority of scaffolds (Fig 3A & C; S7 Fig). Two peaks would be expected if homologs are not identical and similarity is disrupted by indels, with the second peak representing the reduced coverage of the hemizygous region (~120x). For scaffold 2, which is overrepresented in coverage depth, four clear peaks are present (Fig 3) indicating that more than two copies map to the scaffold. This likely resulted from exclusion of the scaffold's homoeolog from the assembly, leading to an assembly collapse as noted above.
Nevertheless, the presence of 4 peaks is consistent with the presence of polymorphisms between homologous copies as well as between homoeologous pairs.

Identification of homoeologous pairs and phasing of subgenomes
Given the diploid nature of the assembly, which represents each subgenome as a single copy, we expected to find scaffolds from these subgenomes present in homoeologous pairs. Through detection of shared orthologs (Supplementary Methods) twenty pairings were identified between the longest 33 scaffolds of the assembly, each sharing between 20 and 351 CDS orthologs (Fig 4, S4 Table). Alternative methods of identifying homoeologous pairs were corroborative (S5 Table). Some pairs were not mutually exclusive, and exhibited CDS links to scaffolds outside of the primary pair, suggesting translocation and syntenic changes between them. All scaffolds that showed two depths of coverage were assigned as homoeologous pairs as expected if each subgenome was heterozygous. Four scaffolds were excluded from pairings including scaffold 2 which showed high amounts of collapse and scaffolds 19, 24, 30, and 31 which are relatively small and/or present as single copies (S7 Fig and 9; S6 Table).
In order to assign contigs to a subgenome, we used a modified version of an ancestral k-mer spectra analysis [45] (Supplementary Methods). This approach is based on the premise that the allopolyploid's subgenomes possess repeats that diverged in the two parental genomes before hybridisation. This results in a distinguishable signature in each subgenome's k-mer spectrum, and allows us to determine the parental species from which a given sequence descends. We successfully phased 85.39% of the genome into A or B subgenomes (Fig 2A; S9 Fig; S6 Table). Scaffold 2, which exhibits extensive assembly collapse, did not phase using these methods. Some smaller scaffolds that did not phase by k-mer based methods were later assigned to a subgenome based on the phase status of their opposing homoeologous scaffold (S6 Table). Of the total length of the final assembly, 39.15% was assigned to subgenome A and 46.24% was assigned to subgenome B, leaving only 14.61% unassigned. Glyphs along top and bottom represent scaffolds assigned to either subgenome A (blue) or subgenome B (red). Green lines mark locations of synteny between transcribed genes identified from mapping of Iso-Seq sequences. Grey lines mark locations of synteny between genes identified through MAKER3 gene prediction. Synteny and collinearity were identified using the MCScan module of JCVI using Iso-Seq informed transcriptional annotation. Scaffolds that could not be assigned to a subgenome are not shown.

Assembly of the allopolyploid genome of M. javanica
We have used long-read sequencing and modern bioinformatic approaches to assemble and phase the allopolyploid genome of the plant pathogenic nematode Meloidogyne javanica. Our goal was to assemble a contiguous diploid assembly for M. javanica, representing the A and B subgenomes separately. Our current diploid assembly (150,545,692 bp) is, as expected for a tetraploid, approximately half the 297 ±27Mb measured by flow cytometry [19,28]. This total length is representative of both A and B subgenomes except for collapsed regions where sequence for both subgenomes A and B is considered together. Because some regions of the assembly have been shown by k-mer profiling and coverage analysis to be present in less than four copies (Fig 1; S8 Fig), splitting the subgenomes into their component copies (A1, A2, B1, B2) to create a tetraploid assembly would not be expected to completely double the length.

Annotation
We annotated our assembly using both ab initio feature prediction algorithms and mapping of the full-length transcript sequences (S3 Table). The total number of genes Additionally, we were also not able to identify a homolog of C. elegans telomerase (trt-1; ACC: NM_001373211.4) in our assembly. This may suggest that non-standard telomere processes might be operating in root-knot nematodes as has been found for fruit flies [49,50].

Genetic variation in M. javanica is dominated by indels
We present the assembly as a diploid representation. Read depth analysis of individual scaffolds indicates that homologs are not identical and indels are frequent.
Additionally some scaffolds appear to be present as a single copy suggesting that one of the homologs may have been lost. We identified many indels that delete or disrupt one or more copies of coding sequences, indicating that M. javanica is no longer genome-wide tetraploid either in copy number or functionality. A higher propensity for indel accumulation has been frequently seen in hybrid parthenogenetic species [51] and partial return to lower ploidy is a characteristic of many polyploids.
We find that 87% of our diploid assembly consists of one copy of either subgenome in homoeologous pairs, with two copies mapping to phased scaffolds and four copies mapping to collapsed regions. We successfully phased much of our assembly into subgenomes using k-mer signatures, enabling for the first time initial genome-wide comparison of MIG A and B subgenomes. Some scaffolds, notably scaffold 2 and some of the short scaffolds, could not be assigned to subgenomes. Scaffold 2 displays four levels of stratification in its coverage (S8 Fig) and four peaks in its depth distribution (Fig 3b) indicating that the four copies (A1+A2 and B1+B2) are almost entirely collapsed into a single scaffold. We suggest that the homoeologous chromosomes represented by scaffold 2 may be too similar in sequence to assign to a subgenome by k-mer analysis due to a possible homogenization event. For the smaller scaffolds, the failure may have been due to their small size because they did not contain enough relevant k-mers. Some of the small scaffolds were later assigned to a subgenome using transcript alignments. No misassembly or erroneous scaffolding of the unphased sequences was detected through either programmatic or manual methods.
The allotetraploid genome of M. javanica demonstrates extensive synteny between the A and B subgenomes with most genic regions represented in four copies. We would expect however that there would be differences between the subgenomes, including indels and other structural variation, as this is typically observed between different species and the MIG have a hybrid origin. In accordance with this, we observed very substantial structural variation between subgenomes A and B, including insertions, deletions, and translocations (Fig 4).

Loss of synteny and fragmentation
We observe regions of the M. javanica genome where synteny between paired chromosomes is disrupted. One reason for this could be that the parental species (A and B) had genomes in which the non-syntenic regions had diverged by translocation, insertion or deletion. Upon hybridisation to create the allopolyploid MIG these diverged regions form the end of synteny blocks. An alternative explanation is that these changes happened after the hybridisation event in the tumultuous process of genome stabilisation immediately following it [52,53]. Hybridisation, polyploidization, and the loss of meiosis are processes often associated with rapid genomic change [54] [51,54,55]  Meloidogyne species, like other nematodes, have holocentric chromosomes (Triantaphyllou, 1985). Genomes with dispersed centromere structure are predicted to better tolerate chromosome fragmentation and fusion [56], as are species with ameiotic mechanisms of reproduction. The observed differences in copy number of chromosomes between isolates of M. javanica may be additional evidence for tolerance of chromosome fragmentation/fusion. Although it is always necessary to consider the possibility of misassembly or rogue scaffolding, we have substantial evidence through manual inspection and contiguous mapping of ultra-long Oxford Nanopore reads that our assembly is correct. reproduces by obligatory mitotic parthenogenesis (apomixis) and this lack of meiotic chromosome pairing may allow greater structural divergence and tolerance for the decay of synteny. We note that genomes from species not able to reproduce by meiosis are currently rare [51] and suggest that much more substantial genomic work on a range of species with different reproductive modes and ploidy levels will be required to reveal the diverse mechanisms shaping these genomes. It is apparent however that the changes surrounding allopolyploidy have contributed to the gene content, heterozygosity, and copy number throughout the M. javanica genome and these processes of locally fixed heterozygosity or rediploidization may contribute extensively to adaptive functional variation [59].

Gene annotation
Iso-Seq reads were mapped to the genome and collapsed using a snakemake

Phasing of subgenomes
Scaffolds were phased into A and B subgenomes using a k-mer-based approach built on the approach taken by [45]. K-mers in the assembly were detected and counted using jellyfish [82], with only k-mers present more than 75 times in the assembly and represented at least twice as often in one subgenome than the other carried forward. These counts were then transformed into binomial distributions. Hierarchical clustering was performed on these sets, creating a dendrogram placing scaffolds into opposing clusters, each cluster representing a subgenome (Supplementary Methods).